In [23]:
%matplotlib inline
from scipy.stats import binom, norm, poisson, hypergeom,
rv_discrete
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
By the end of this section, you will be able to:
In the first part, we learned about descriptive statistics, and common plots used to visualize the data.
In this section, we will learn about inferential statistics, which when boiled down to its core, is all about figuring out how probable our data (or more extreme) could have been under random chance.
We observe some data. We want to know what probability we could have observed that data or something more extreme, purely by random chance.
To do so, we need a few ingredients:
Inspired by @jakevdp (Jake Vanderplas, UW eScience Institute), we will not be using analytical methods (i.e. algebra) in this segment, but instead we will use computational methods.
Here is a datapoint: We have flipped a coin 20 times, and it came out to be heads 16 times. Is this coin biased or not?
In [87]:
null_flips = binom.rvs(n=20, p=0.5, size=10000)
plt.hist(null_flips)
plt.axvline(16)
Out[87]:
In [2]:
alpha = 5 / 100
null_flips = binom.rvs(n=20, p=0.5, size=10000)
plt.hist(null_flips)
plt.axvline(16)
Out[2]:
In [90]:
sum(null_flips >=16) / 10000
Out[90]:
Our test here tells us whether it is possible for us to see X number of heads or more under randomness. Let us count up the number of times we see 16 or more heads under the null model.
In [3]:
num_heads = sum(null_flips >= 16)
num_heads
Out[3]:
Finally, let's calculate the probability of seeing this number by random chance.
In [4]:
p_value = num_heads / len(null_flips)
p_value
Out[4]:
The probability of seeing this under a null model of p=0.5 is very low (approximately 1 in 200). Therefore, it is likely the case that the coin is biased.
Epidemiologists have been longitudinally measuring the rate of cancer incidence in a population of humans. Over the first 10 years, cancer rates were 10 per 10000 by the age of 30. In the 11th and 12th years, cancer rates went up to 14 per 10000. Calculate the p-value of the 11th year and the 12th year separately.
Hint: Model this as a Poisson process.
What are you assuming about the null model?
In [94]:
null_poisson = poisson.rvs(10, size=100000)
sum(null_poisson >= 14) / 100000
Out[94]:
Visualize this using the plot below.
In [6]:
plt.hist(null_poisson)
plt.axvline(14)
Out[6]:
In [32]:
null_hypergeom = hypergeom.rvs(M=100, n=30, N=50, size=10000)
sum(null_hypergeom >= 22) / 10000
Out[32]:
Once again, plot this below.
In [33]:
plt.hist(null_hypergeom)
plt.axvline(22)
Out[33]:
Systolic blood pressure in healthy patients is Normally distributed with mean 122 mmHg and std. deviation 9 mmHg. In hypertensive patients, it is Normally with mean 140 and std. deviation 6 mmHg. (I pulled out these numbers from a hat - http://www.cdc.gov/nchs/data/nhsr/nhsr035.pdf)
A patient comes in and measures 130 mmHg.
What is the p-value of this patient's blood pressure, under:
Plot the distributions and the patient's systolic blood pressure.
In [9]:
null_healthy = norm.rvs(122, 9, size=100000)
null_hyper = norm.rvs(140, 6, size=100000)
sns.kdeplot(null_healthy)
sns.kdeplot(null_hyper)
plt.axvline(130)
Out[9]:
In [10]:
# P-value calculation under healthy hypothesis
np.sum(null_healthy >= 130) / len(null_healthy)
Out[10]:
In [11]:
# P-value calculation under hypertension hypothesis
np.sum(null_hyper <= 130) / len(null_hyper)
Out[11]:
In [12]:
# inhibitor1 = norm.rvs(70, 22, size=20)
# inhibitor2 = norm.rvs(72, 10, size=30)
# control = norm.rvs(75, 20, size=15)
# data = pd.DataFrame([inhibitor1, inhibitor2, control]).T
# data.columns = ['Inhibitor 1', 'Inhibitor 2', 'Control']
# data.to_csv('drug_inhibitor.csv')
data = pd.read_csv('drug_inhibitor.csv', index_col=0) #
data.mean().plot(kind='bar', yerr=data.std())
data.head(10) # this displays only the first 10 rows. Take out ".head()" to display a larger view of the data.
Out[12]:
Okay, so here's the important, potentially billion-dollar market question: Is inhibitor 1 any good?
Basically, the question we're asking is this: Is the there a difference between the enzyme's activity under inhibitor 1 vs. the control?
Here are a few problems we face here:
N~(mu,sd)) distribution around the mean of each group, but you'd have to justify why this is a rational choice.Is there a way around this?
We can perform a permutation test. Here's the setup.
Therefore...
In [29]:
import numpy as np
differences = []
for i in range(1000):
# Select out the two columns of data of interest
means = data[['Inhibitor 1', 'Control']]
# Select
means = means.apply(np.random.permutation, axis=1).mean()
difference = means['Control'] - means['Inhibitor 1']
differences.append(difference)
differences = np.array(differences)
plt.hist(differences, label='Null')
plt.axvline(data.mean()['Control'] - data.mean()['Inhibitor 1'], color='red', label='Data')
plt.legend()
Out[29]:
In [14]:
data_diff = data.mean()['Control'] - data.mean()['Inhibitor 1']
np.sum(differences >= data_diff) / len(differences)
Out[14]:
The probability of seeing this or more difference, assuming that the inhibitor treatment has no effect, is about 3 out of 20 times. We might not be so convinced that this is a good drug.
In this section, we will learn:
What is model fitting all about? It's basically about finding the parameter set that best describes your data.
Let's start by taking a random sample of data from a Normal(10, 2) distribution.
In [15]:
sample = norm.rvs(10, 0.5, size=6) # we will draw 6 instances from the normal distribution.
sns.kdeplot(sample, color='blue')
Out[15]:
In [27]:
population = norm.rvs(10, 0.5, size=10000)
sns.kdeplot(sample, color='blue', label='sample')
sns.kdeplot(population, color='red', label='population')
plt.legend()
Out[27]:
Create a new distribution called fitted, and tweak its parameters so that it fits the sample distribution.
The key point here is that the samples are random samples of the population. The computed mean from the sample may not necessarily match with the actual population mean.
The central limit theorem is central to statistics.
Quoting from Wikipedia:
In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.
We're going to experimentally look at what this statement means.
In [17]:
xs = range(1,7) # numbers 1 through 6
ps = [0.4, 0.1, 0, 0.2, 0.3, 0]
crazy_dice = rv_discrete(name='crazy_dice', values=[xs, ps])
Instructor demo: Let's visualize this distribution.
In [18]:
crazy_dice.pmf(xs)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(xs, crazy_dice.pmf(xs), 'ro', ms=12)
ax.vlines(xs, [0]*len(xs), crazy_dice.pmf(xs), color='red', lw=4)
Out[18]:
With this dice, we are essentially saying that we expect the dice to throw 1s (40% of the time), 5s (30% of the time), 4s (20% of the time), and 2s (10% of the time), with no expectation of getting 3s and 6s.
Now, let's roll the dice 4 times. What's the mean?
In [78]:
# Class activity
sample1 = crazy_dice.rvs(size=50)
sns.kdeplot(sample1)
np.mean(sample1)
Out[78]:
Roll the dice 4 times again. What's the mean?
In [20]:
# Class activity
sample2 = crazy_dice.rvs(size=4)
np.mean(sample2)
Out[20]:
Roll the dice 4 times, but repeat 1000 times now. What's the distribution of the means?
In [84]:
# Class activity
means = []
for i in range(1000):
mean = np.mean(crazy_dice.rvs(size=4))
means.append(mean)
# plt.hist(np.array(means), color='green')
sns.kdeplot(np.array(means), color='blue') # plot the density plot
sns.kdeplot(norm.rvs(2.9,0.6,size=100000), color='red') # plot an approximation of the actual distribution
Out[84]:
In [ ]:
Try calculating by hand the expected value of the dice. The definition of the expected value is:
$$\sum_{i=1}^{n}X_i p_i$$Or in plain English, sum up the product of each value and its probability.
Did you get 2.9?
In [ ]:
What is statistics all about?
Also, the Central Limit Theorem explains much about what we know about the world.
Note on parametric vs. non-parametric tests: use parametric when you have a known mechanistic process. More accurate results when the model correctly describes the underlying process, compared to non-parametric.
In [ ]: